Scalable K-Means++
نویسندگان
چکیده
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld large-scale data demonstrates that k-means| outperforms k-means++ in both sequential and parallel settings.
منابع مشابه
Scalable Kernel k-Means via Centroid Approximation
Although kernel k-means is central for clustering complex data such as images and texts by implicit feature space embedding, its practicality is limited by the quadratic computational complexity. In this paper, we present a novel technique based on scalable centroid approximation that accelerates kernel k-means down to a sub-quadratic complexity. We prove near-optimality of our algorithm, and e...
متن کاملEmbed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastruc...
متن کاملCommunication Challenges in Cloud K-means
This paper studies how parallel machine learning algorithms can be implemented on top of Microsoft Windows Azure cloud computing platform. More specifically, we design efficient storage based communication mechanisms that lead to a scalable implementation of the K-means.
متن کاملFast, single-pass K-means algorithms
We discuss the issue of how well K-means scales to large databases. We evaluate the performance of our implementation of a scalable variant of K-means, from Bradley, Fayyad and Reina (1998b), that uses several, fairly complicated, types of compression to t points into a xed size buuer, which is then used for the clustering. The running time of the algorithm and the quality of the resulting clus...
متن کاملWasserstein k-means++ for Cloud Regime Histogram Clustering
Much work has sought to discern the different types of cloud regimes, typically via Euclidean k-means clustering of histograms. However, these methods ignore the underlying similarity structure of cloud types. Wasserstein k-means clustering is a promising candidate for utilizing this structure during clustering, but existing algorithms do not scale well and lack the quality guarantees of the Eu...
متن کاملScalable Embeddings for Kernel Clustering on MapReduce
There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 5 شماره
صفحات -
تاریخ انتشار 2012